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Abstract 

In recent years, a considerable amount of work has been devoted to generalizing linear dis- 



criminant analysis to overcome its incompetence for high-dimensional classification (Witten 
& Tibshkani|2011||Cai~& Liu|201 l||Mai et al.|2012l|Fan et al.pOT2] ). In this paper, we develop 
high-dimensional semiparametric sparse discriminant analysis (HD-SeSDA) that generalizes 

the normal-theory discriminant analysis in two ways: it relaxes the Gaussian assumptions and 

CO 

t— I can handle non-polynomial (NP) dimension classification problems. If the underlying Bayes 

rule is sparse, HD-SeSDA can estimate the Bayes rule and select the true features simultane- 
ously with overwhelming probability, as long as the logarithm of dimension grows slower than 
the cube root of sample size. Simulated and real examples are used to demonstrate the finite 
sample performance of HD-SeSDA. At the core of the theory is a new exponential concentra- 
tion bound for semiparametric Gaussian copulas, which is of independent interest. 

Keywords: Gaussian copulas, Linear discriminant analysis, NP-dimension asymptotics, Semipara- 
metric model. 
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1 Introduction 



Despite its simplicity, linear discriminant analysis (LDA) has proved to be a valuable classifier in 
many applications (Michie et al. 1994 Hand 2006). Let X = (xi, . . . , x p ) denote the predictor 



vector and Y G {+1, -1} be the class label. The LDA model states that X \ Y ~ N(fj, Y , £), 
yielding the Bayes rule 

^ Baycs = sign ((X - l -{^ + + M-))^" 1 ^ - fi.) + log ?±\ , 

where n y = Pr(F = y). Given n observations (Y\X l ), 1 < i < n, the classical LDA classifier 
estimates the Bayes rule by substituting S, fi y and n y with their sample estimates. As is well 
known, the classical LDA fails to cope with high-dimensional data where the dimension, p, can 
be much larger than the sample size, n. A considerable amount of work has been devoted to 
generalizing LDA to meet the high-dimensional challenges. It is generally agreed that effectively 
exploiting sparsity is a key to the success of a generalized LDA classifier for high-dimensional 



data. Early attempts include the nearest shrunken centroids classifier (NSC) (Tibshirani et al. 2002) 
and later the features annealed independence rule (FAIR) ( Fan & Fan]|2008[ ). These two methods 
basically follow the diagonal LDA paradigm with an added variable selection component, where 
correlations among variable are completely ignored. Recently, more sophisticated sparse LDA 
proposals have been proposed; see Trendafi lov & Jolliffe| ( 2007] ), Wu et al. (2008), Clemmensen 
elaLl pOTTT ), |Witten & Tibshirani| pOTT] ), |Mai et al] pOT2l ), |Shao et al.| pOTTT ), |Cai & Liu] pOTTT ) 



and Fan et al. (2012 ). In these papers, a lot of empirical and theoretical results have been provided 
to demonstrate the competitive performance of sparse LDA for high-dimensional classification. 
These research efforts are rejuvenating discriminant analysis. 

However, the existing sparse LDA methods still require the Gaussian data assumption, at least 
in theory. Empirical evidence given in Section 5.1 shows that sparse LDA methods become inef- 
fective for non-normal data. In the lower dimensional classification problems, some researchers 
have considered ways to relax the Gaussian distribution assumption. For example, Hastie & Tib 
shirani (1996) proposed the mixture discriminant analysis (MDA) that uses a mixture of Gaussian 
distributions to model the conditional densities of variables given the class label. MDA is estimated 



by the Expectation-Maximization algorithm. MDA is a nonparametric generalization of LDA, but 
it is not clear how to further extend MDA to the high-dimensional classification setting with the 
ability to do variable selection. Lin & Jeon ( 2003| ) proposed another interesting approach to relax- 



ing the Gaussian data assumption in LDA. Their approach starts with the assumption that, through 
a set of unknown monotone univariate transformations, the observed data follow the LDA model, 
and hence the new model is called the semiparametric LDA model (SeLDA). |Lin & Jeon (2003 1 



further showed that the unknown transformations can be accurately estimated and thus the SeLDA 
model can be estimated when p is fixed and n goes to infinity. With the consistently estimated 



transformation, one can transform the data and fit a LDA model. However, the estimator in Lin & 



Jeon (2003 1 cannot handle high-dimensional classification problems, especially when p exceeds n. 
In this paper, we develop high-dimensional semiparametric sparse discriminant analysis (HD- 
SeSDA), a generalization of SeLDA for high-dimensional classification and variable selection. In 
particular, we propose a new estimator for the transformation function and establish its uniform 
consistency property as long as the logarithm of p is smaller than the cube root of n. With the new 
transformation estimator, we can transform the data and fit a sparse LDA classifier. In this work we 
use the direct sparse discriminant analysis (DSDA) developed by Mai et al. ( 2012[ ). HD-SeSDA 



enjoys great computational efficiency: its computational complexity grows linearly with p. We 
show that, if the Bayes rule of the SeLDA model is sparse, then HD-SeSDA can consistently select 
the important variables and estimate the Bayes rule. At the core of the theory is an exponential 
concentration bound for semiparametric Gaussian copulas, which is of independent interest. 

The rest of this paper is organized as follows. The semiparametric LDA model is introduced 
in Section 2, and the methodological details of HD-SeSDA are introduced in Section 3. Statistical 
theory is presented in Section 4. Numerical examples are shown in Section 5 to demonstrate the 
finite sample performance of HD-SeSDA. Technical proofs are relegated to the appendix. 

2 Semiparametric LDA Model 

Consider the binary classification problem where we have observed n random pairs (Y\ X 1 ), 1 < 
i < n and wish to classify Y using a function of X. |Lin & Jeon| (|2003 ) proposed the following 



semiparametric LDA (SeLDA) model that assumes that 

(h^Xj,- ■ ■ ,h p (X p )) \Y ~ N(liy,X), 

where h = (hi, • • ■ , h p ) is a set of strictly monotone univariate transformations. It is important 
to note that the SeLDA model does not assume that these univariate transformations are known 
or have any parametric forms. By properties of the Gaussian distribution, h is only unique up to 
location and scale shifts. Therefore, for identi fi ability, assume that fi + = 0, Y>jj = 1,1 < j < p. 
The Bayes rule of the SeLDA model is 

^ Bayes = sign ((h(X) - io*+ + ^_)) T E- 1 (/. + - /i_) + log ^j . 

The SeLDA model is a very natural generalization of the LDA model. It is equivalent to mod- 
eling the within-group distributions with semiparametric Gaussian copulas. For any continuous 
univariate random variable, W, we have 

r'oj^) ~ jv(o, l), 

where F is the cumulative probability function (CDF) of W and $ is the CDF of the standard 
normal distribution. Gaussian copula is a multivariate generalization of that simple fact of uni- 
variate case. Semiparametric Gaussian copula has generated a lot of research interests in recent 
years; see |Klaassen & Wellner| ( [T997| ), |Song| pOOO] ), |Tsukahara| ( [2003] ), |Chen & Fan] pOOo) and 



Chen et al. (2006). The SeLDA model is the first application of semiparametric Gaussian copula 
in the context of classification. The SeLDA model can be viewed as an additive model by adding 
semiparametrically specified two-way interactions ( |Lin & Jeon|2003| ). The additive model ( |Hastie 
& Tibshirani|1990| ) assumes that the log-likelihood takes the following form: 



log f y (X) =iP y0 + J2 Tpyj(Xj) 



(1) 



i=i 



When interactions are not negligible, the additive model ([[]) may not be sufficient. SeLDA tries to 
model the interaction effects. Define fi = (tOjk) = S _1 and then the SeLDA model can be written 

as 



log fy(X) = lp y0 + Y^ *Pyj ( X j ) + Yl ^y,jk (Xj , X k 



(2) 



where 



Ujj{hj{Xj) - Vyj) 2 

2 
rfyjkiXjjXk) = — ojjk{hj(Xj) — Hyj)(hk(Xk) — fi y k)- 



M*j) = - jjK iV T w +iog\h>(x j )\, 



In the SeLDA model, the main effects in ([2]) are as general as those in the additive model ([T|), 
because any univariate continuous random variable follows the semiparametric normal distribu- 
tion, while the two-way interactions are semiparametrically specified to strike a balance between 
flexibility and computation cost. For more detailed discussion on the connection between the 



semiparametric model and nonparametric models, the readers are referred to Lin & Jeon (2003). 

In light of Q, the SeLDA model can be estimated in the low-dimensional setting. The basic 
idea is straightforward: we first find hj(-) as good estimates of these univariate transformation 
functions and then fit the LDA model on the "pseudo data" ( Y\ h{X % ) j , 1 < i < n. To be more 
specific, in seek of hj, we let F + j, F_j be the CDF of X, conditional on Y = +1 and Y — — 1, 
respectively, and then we have 

hj = &- 1 o F +j = &- 1 o F_j + /i_. 

It can be seen that we only need an estimate of F + j. For convenience, denote X y j as the jth entry 
of an observation X belonging to the group Y = y, and F + j as the empirical CDF of X + j. Note 
that, we cannot directly plug in F + j so that hj = $ _1 o F + j, because infinite values would occur 
at tails. Instead, F + j is Winsorized at a predefined pair of numbers (a, b) to obtain F"' 



(3) 



Then 

(4) 

The Winsorization can be viewed as a bias-variance trade-off. If necessary, one could interchange 
the group labels so that n + > n_ to obtain more efficient estimates, where n y is with-in-group 
sample size. 





b if F +j (x)>b; 


+; b (*) = < 


F +j (x) if a < F +j (x) < b; 




a if F + j(x) < a. 




'h * r +3 ■ 



With hj, the covariance matrix E is estimated by the pooled sample co variance matrix of h(X l 
and //_j is estimated by 

where <p is the density function for a standard normal random variable and 

1 p 

q= ~Z^ 1 ^+ J (^)e(a,b)- 



/i_j has this complicated form because of the Winsorization. Lin & Jeon (2003 1 showed that when 
p is fixed and n tends to infinity, E, /t_ are consistent. 

3 Estimation of The High-dimensional Semiparametric LDA 
Model 

There are two fundamental difficulties in applying SeLDA to high-dimensional classification. First, 
Lin & Jeon ( 2Q03J ) justified their estimates of the transformation functions but their asymptotic 



theory only works for the fixed p setting. We show later that, in order to obtain good estimators of p 
transformation functions uniformly, we shall modify the estimators defined in (]3]) and Q. Second, 
the second stage of SeLDA estimation is just the ordinary LDA, which is infeasible for high- 
dimensional problems, even when we know the true transformation functions. To overcome this 
difficulty, we propose to fit a sparse SeLDA model by exploiting sparsity assumption on the Bayes 
rule. For the sake of presentation, we first discuss how to fit a sparse SeLDA model, provided that 
good estimators of hj(-), 1 < j < p, are already obtained. After introducing the sparse SeLDA, 
we focus on a new strategy to estimate hj(-),l < j < p. 

3.1 Exploiting sparsity 

We assume that the Bayes rule of the SeLDA model only involves a small number of predictors. To 
be more specific, let fj Ba ^ es = E _1 (;U + — //_) and define A = {j : /3 ayes ^ 0}. Sparsity means that 



\A\ <C p. An elegant feature of SeLDA is that it keeps the interpretation of LDA, that is, variable 
j is irrelevant if and only if /? ■ ayes = 0. 

Suppose that we have obtained hj(-) as a good estimate of hj(-), 1 < j < p, we focus on 
estimating the sparse LDA model using the "pseudo data" (F J ,/i(X*)), 1 < i < n. Among the 
previously mentioned sparse LDA proposals in the literature, only |Fan & Fan] ( |2008[ ), |Shao et aL 
pOTTj ), |Cai & Liu| pOTTT ), |Mai et al.| ( |20T2| ) and |Fan et al.| ((2012]) provided theoretical analysis 
of their methods. Fan & Fan] (2008 )'s theory assumes assume E is a diagonal matrix. |Shao et aL 



(201 1 )'s method works well only under some strong sparsity assumptions on the covariance matrix 



£ and fii — /i 2 . The sparse LDA methods proposed in Cai & Liu (2011), Mai et al. (2012) and 



Fan et al. (2012) are shown to work well under general correlation structures. From the theoretical 
perspective, the methods in |Cai & Liu] ( |2011[ ), |Mai et al.| ( |2012[ ) and |Fan et"aT] ( |2012[ ) are the most 
competitive among the existing sparse LDA proposals. We briefly review these three sparse LDA 
methods in the sequel. 

The linear programming discriminant (LPD) rule proposed by |Cai & Liu (2011 1 takes note of 
that, for the true /3 Baycs , we must have 



E/3 Bayes - (fi + - im_) =0. 



(5) 



Therefore, when E, //+,yu_ are unknown, LPD finds a sparse approximation of (3 via a formula 
similar to the Dantzig selector in linear regression (Candes & Tao|2007 ). It estimates (3 by 



P 



LPD 



argmin^ \/3j\, s.t. \\Efi - (/}+ - /L)||oo < A. 



(6) 



i=i 



Cai & Liu (201 1 1 solved LPD by the primal-dual interior-point method. 

Regularized optimal affine discriminant (ROAD) (Fan et al. 2012[ ) and sparse linear discrim- 



inant analysis (sLDA) (Wu et al. 2008) use slightly different yet essentially identical formulas. 
They are derived from the fact that 

/3 T (7r + 7r_)(/i + - /}_)(/}+ - fi-Yfi 



are max 



/3 T E/3 



(7) 



which yields the same discriminant direction as minimizing ,5 T E/3 subject to (/t + — fi-) T (3 = c, 



for any positive constant c. Therefore, ROAD estimates the discriminant direction by 



P 



ROAD 



argmin/3 T £/3, s.t. (fi + - fi-) T (3/2 = 1 and ^ \fy\ < 



(8) 



i=i 



The formula for sLDA is identical except that it replaces (/2 2 — /ti) /?/2 = 1 with (/i 2 — Ai) /3 = 1 



Both pan et al.| ( |2QT2| ) and |Wu et al.| ( |2008| ) proposed efficient algorithms for computing ([8]). 

Mai et al. ( 2012[ ) proposed the direct sparse discriminant analysis (DSDA) to estimate the di- 



rection of the Bayes rule, which is equivalent to estimating the Bayes rule in terms of classification 
and variable selection. It is motivated by the fact that the classical LDA direction can be exactly 



recovered by doing linear regression of Y on h(X) (Hastie et al. 2008) where Y is treated as a 
numeric variable. In other words, 



/3 Bayes oc argmin^ (Y i - (3 - h(X i ) T f3)' 



P 



(9) 



i=l 



Define C = Cov (h(X)), and 0* = C _1 + - /i_). It can be shown that /?* and (3 Ba - jes have the 
same direction. DSDA aims at estimating /3* by the following penalized least squares approach 
dMai et al.|20"I2| ): 







DSDA 



00 



DSDA 



arg min{n 1 \^ {Y 1 

i=l 

(/i + + /i_) T /3 DSDA 



A) 



3=1 



(10) 



+ 



p 



DSDAtv«DSDA 



* log — 

_)TflDSDA ° tt_ 



(A+- A 

where, under the LDA model, h is known to be h(X) = X, and P\(-) is a sparsity-inducing 
penalty, such as Lasso ( |Tibshirani|[T996l ) or SCAD ( pan & Li||200T| ). Then the DSDA classifier 
is sign (/3j >SDA + /i(X) T /3 DSDA ). There are many other penalty functions proposed for sparse 
regression, including the elastic net ( |Zou & Hastie|2005| ), the adaptive lasso ( |Zou|2006l ), SICA ( |Lv 
& Fan|2009"l ) and the MCP ( [Zhang|20i0] ), among others. All these penalties can be used in DSDA. 



The original paper (Mai et al. 2012) used the Lasso penalty where P\(t) = Xt for t > 0. One could 
use either lars (Efron et al. 2004) or glmnet (Friedman et al. 2008J ) to efficiently implement 
DSDA. 



If we knew these transformation functions h in the SeLDA model, (10) could be directly used 
to estimate the Bayes rule of SeLDA. In HD-SeSDA we substitute hj with its estimator hj and 



8 



apply sparse LDA methods to (Y, h(X)). For example, to use DSDA in the SeLDA model, we 
solve for 



$ = aigmm{n- 1 J2(Y l -Po-h(Xrp) + J>a(I&I)}> 
t=i 

(/x + + A-) T /5 



(11) 



i=i 



A) 



-log — 

(A+"/i-) T /3 *- 



Then (11) yields the HD-SeSDA classification rule: sign /3 + h(X) T /3 



Remark 1. After using Theorem 1 to transform the data, one can use any of the aforementioned 
sparse discriminant analysis techniques to build a semiparametric sparse discriminant analysis 
method. Among these methods, LPD, ROAD and DSDA enjoy strong theoretical justifications 
without assuming a strong structure assumption on the variables. Under proper conditions, LPD 



is consistent in the sense that it achieves the Bayes error rate asymptotically. Fan et al. (2012) 



also proved that the classification error rate of ROAD converges to the Bayes error rate with 



overwhelming probability. On the other hand, Mai et al. (2012) showed that if the Bayes rule is 



sparse, DSDA can discover the true subset of variables and consistently estimate the Bayes rule 
with overwhelming probability. 

Moreover, it has been recently discovered that these nice sparse discriminant analysis tech- 
niques are closely connected. In another paper which is accepted for publication, the authors 
showed that sLDA and DSDA are equivalent in the sense that for proper sequences of X they give 
the same set of directions. Because sLDA and ROAD are identical methods, that result basically 
implies certain equivalence between ROAD and DSDA. To see a connection between LPD and 
DSDA, let us consider a constrained version of DSDA by following the analogy ofDantizg selector 
and Lasso least squares 



argmin^ 1/3^1, s.t. \\t h{x) (3 - (jj, + - £-)||oo 



< A. 



Note that the above formulation is very similar to LPD, except that we now use the covariance of 
predictors not the within-class covariance matrix. It is now well known that the Dantizg selector 
and Lasso least squares are very similar in general Bickel et al. (2009), J ames et aL\ ( 2009 ). Under 



certain conditions, the two give identical solution paths \James et al. ( 2009 ). Based on these results, 
it is natural to expect that DSDA and LPD have similar performance. 

Due to the above considerations, we stick with HD-SeSDA in the rest of the paper to illustrate 
the theory and application of high-dimensional semiparametric sparse discriminant analysis. 



3.2 Uniform estimation of transformation functions 

We propose a high-quality estimator of the monotone transformation function. In order to establish 
the theoretical property of HD-SeSDA, we need all p estimators of the transformation function to 
uniformly converge to the truth at a certain fast rate, even when p is much larger than n. Our 
estimator is defined as 



+3 



,.r 



1 



n- 



ifF +j (x) > 1 



ni 



F +j (x) 



nz 



if\<F +3 (x)<\-\ 
n± n± 



(12) 



i£F +j {x) < 



1 



nz 



and then 



h 3 =^- l oF +] . 



In other words, instead of fixing the Winsorization parameters a, b as in ([3]), we let 



(a, b) = (a n ,b n ) = (- 



n 



2 ■ 



1 - 



/?: 



(13) 



With the presence of $ 1 , it is necessary to choose a n > 0, b n < 1 to avoid extreme values at tails. 
On the other hand, a n — » 0, b n — > 1 so that the bias will automatically vanish as n — > oo. To further 



see that ( [13] ) are proper choices of a n , b n , see the theory developed in Section 3 for mathematical 
justification. 

Other estimations have been proposed. For example, Liu et al. ( 2009[ ) considered a one-class 
problem with Gaussian copulas, which essentially states h(X) ~ N(0, E), and aims to estimate 
S _1 . In their paper, hj is estimated by hj = $ _1 o F an < bn , where a n = 1 — b r , 



4r7, 1 / 4 v / vr log n 



Liu et al. (2009) showed that this estimator is consistent when p is smaller than any polynomial 



10 



order of n, but it is not clear whether the final HD-SeSDA can handle non-polynomial high di- 
mensions. Rank-based estimators were independently proposed by Liu et al.| (2012), Xue & Zou 



(2012) for estimating E _1 without estimating the transformation functions. However, in the dis- 
criminant analysis problem considered here we need to estimate both S _1 and the mean vectors. 
The estimation of the mean vectors requires to estimate the transformation functions. 



4 Theoretical Results 

4.1 Estimation of transformation functions 

To explore the consistency property of HD-SeSDA, we first study the estimation accuracy of 
semiparametric Gaussian copulas. The results in this subsection are applicable to any statistical 
model using semiparametric Gaussian copulas, which is of independent interest itself. Consider 
the one-class estimation case first. Assume that X is a p-dimensional random variable such that 



h(X) ~ N(Q P , S) with hj = $ _1 o Fj and hj = $ _1 o Fj, where Fj is defined as in (J12J). Denote 
fij and ajk as the sample mean and sample covariance for corresponding features. We establish 
exponential concentration bounds for jlj and aj k . 

Theorem 1. Define 



,e 2 



die) = 2exp(-^)H-4e X p(-cn--)H-4exp(-cn^) 

Q(e) = cexp(— cne 2 ) + cexp(— ens p ) + cexp(— en 1 p ) 

n l -Pe 2 . 
+cexp(-c 2 ) 

p z log n 

where c is a generic positive constant. For sufficiently large n and any < p < |, there exists a 

positive constant e such that, for any < e < e , we have 

Fifth - H\ > e) < Q(e) (14) 

Fi(\a jk -a jk \>e) < C 2 *(e) (15) 



Remark 2. Semiparametric Gaussian copulas have been used by \Liu et~aT. (2009) to develop a 



semiparametric graphical model. They derived some probability bounds concerning jlj and &j k 

11 



as well, but they required that p should grow at a polynomial order of n. Our results are much 
stronger because p can be as large as exp(ns~ p ) for any < p < -. Theorem 1 and its proof 
can be used for other high-dimensional statistical problems involving semiparametric Gaussian 
copulas. 

For the two-class SeLDA model, we can easily obtain the following corollary from Theorem[T] 

Corollary 1. Define 

Ci(c) = Ci*(^) + Ci*(^p)+4exp(-cn) (16) 

C 2 (e) = C 2 l^) + C2(^p)+4exp(-cn) + 2C 1 (e) (17) 

Then there exists a positive constant e such that, for any < e < e , we have 

Pr (l(A+i - A-i) - (v+j - H-j)\ > e) < Ci(e) 
PrQajk -<?jk\ > e) < (2(e) 
Corollary 1 is the fundamental result for establishing the rate of convergence of HD-SeSDA. 

4.2 Consistency of HD-SeSDA 

With the results in Section 4.1, we are ready to prove the rate of convergence of HD-SeSDA. We 
first define necessary notation. Define /3* = C _1 (/i + — //-). Recall that f3* is equal to cS _1 (|U + — 



//_) = c/3 Bayes for some positive constant (Mai et al. 2012). Then we can write A = {j : (3* 7^ 



0}. Let s be the cardinality of A. In addition, for an m\ x m 2 matrix M, denote HM]^ = 
max i=lj mi YlT=i \Mij\, and, for a vector u, ||w||oo — max \ u j\- Throughout the proof, we assume 
that s <C n 1//4 . Define the following quantities that are repeatedly used: 

^ = \\Ca c a(Caa)~ ||oo) { P=\\{Caa) ||oo) A = \\h+a ~ V>-a\ 



loo; 



min iej4 1/3^1 



A<y3 

Suppose that the lasso estimator correctly shrinks fi^c to zero, then HD-SeSDA should be 
equivalent to performing SeLDA on Xa- Therefore, define the hypothetical estimator 



n ' x 2 



$r = ^^{n- 1 E [ Yi - A) - E M*i)& + E a i^i>- 

i=l V jeA / jeA 

12 



Then, we wish that (3 = ((3/ p ,0a c ) with J3j ^ for j E A. To ensure the consistency of HD- 
SeSDA, we further require the following condition: 



« = \\Ca c a(Caa) 



-ii 



< 1. 



(18) 



The condition in ( 18) is an analogue of the ir-representable condition for the lasso penalized lin 
ear regression model (Meinshausen & B uhl mann||2006 [ Zou||2006 Zhao & Yu|2006 , Wainwright 



2009) 



Theorem 2. Define C,\ , (2 as in Corollary 



Pick any A such that A < min{ 



2(p 



, A}. Then 



for any e > and sufficiently large n such that e > Csn l l 4 , where C does not depend on (n, p, s), 
we have 



1. Assuming the condition in (18), with probability at least 1 — ipx> ft a = P/ P an d Pa c = 0, 



where 



^«4) + «(^a^: 



and e is any positive constant less than min < eo 



A(l-«) 



4<p (A/2 + (1 + k)A) 
2. With probability at least 1 — ■0 2 » none of the elements of /3a is zero, where 



?P 2 = 2s 2 ( 2 (-)+2sCi(e) 
s 



and e is any positive constant less than min < e 



Au 



(3 + v)(p'6 + 2v 

3. For any positive e satisfying e < min < e , t-, A >, we have 

{ 2ipA J 

Pr(||/3A-/3A|U<4^A) > l-2s 2 C 2 (-)- 



2<i(e) 



Theorem|2]provides the foundation for asymptotic results. Assume the following two regularity 

conditions. 

s 2 log(ps) 



(CI). n,p — > 00 and 



n ■-'> 



n-p 



— y 0, for some p in (0, 1/3); 



(C2). mhijgA \(3j\ 3> max{sn x / 4 , J\og(ps)— — } for for some p in (0, 1/3). 

y n3~ p 

Condition (CI) restricts that p, s should not grow too fast comparing to n. However, p is 
allowed to grow faster than any polynomial order of n. Condition (C2) states that the important 



13 



features should be sufficiently large such that we can separate them from the noises, which is a 
standard assumption in the literature of sparse recovery. The next theorem shows that HD-SeSDA 
consistently recovers the Bayes rule of the SeLDA model. 

Theorem 3. Let A = {j : (3j ^ 0}. Under conditions (CI) and(C2), if we choose X = X n such that 



s 2 



X n <C miiij^A \/3j\ and X n ^> Wlog(ps)— — , and further assume k < 1, then Pr(A — A) — > 1 



n - J - 



%-p 



and Pr (\\J3 A - /5 A ||oo < 4v?A„j ->■ 1. 

Remark 3. Although penalized least squares is used for feature selection in HD-SeSDA, Theorems 



2 and 3 are fundamentally different from the theoretical results of lasso penalized regression {Zhao 



& Yu\\2006 Wainwrig ht 2009), because the previous work is built on the linear regression model 



Y = X/3 + e with e being independent normal or sub-Gaussian and this model is obviously not 
true for {Y\ h^X])) or (Y\ hj{X})). 

Further, we prove that HD-SeSDA is asymptotically equivalent to the Bayes rule in terms of 
error rate as n tends to infinity. Define the Bayes error rate R = Pi(Y ^ sign(/i(X) T /3* + (3q)) 
and R„ = Pr(Y ^ sign(/i(X) T /3 + /3 ))- Then we have the following theorem. 

Theorem 4. Define C,\ , (2 as in Corollary 

for a sufficiently small constant e > and sufficiently large n such that e > CsrT 1 ^, where C 

does not depend on (n, p, s), with probability no smaller than 1 — ip 3 , we have R n — R < e, where 

^3 = CsU *, A J + CpCi{ Kl ~ K ^h + 2psU~) + 0(1). (19) 

s(0Ai + A 2 ) 4(1 + K) s 

Corollary 2. Under conditions (CI) and (C2), if we choose X = X n such that X n <^C min, eA |/3j| 



Pick any X such that X < mini — , A}. Then 

2cp 



s 2 



and X n ^> \j\og(ps)— — , and further assume k < 1, then 

713~ P 

R n — R — y in probability (20) 

Remark 4. Our results concerning the error rate ofHD-SeSDA are much more involved than those 



for sparse LDA algorithms in Cai & Liu (2011 ), Fan et al. (2012), because of the semiparametric 



assumptions. Under the parametric LDA model, the error rate tends to the Bayes error as long as 
the discriminant direction (3 is estimated consistently. However, under the SeLDA model, we deal 
with the extra uncertainty in estimating h and need some uniform convergence results on h(X). 
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5 Numerical Results 



5.1 Simulation 



We examine the finite sample performance of HD-SeSDA by simulation. For comparison, in the 
simulation study we also include DSDA and the sparse LDA algorithm (Witten & Tibshir ani|2011 ) 



denoted by Witten for presentation purpose. After we apply the estimated transformation to the 
data, we use Witten's sparse LDA algorithm to fit the classifier. This gives us Se- Witten, another 
competitor in the simulation study. 

Four types of SeLDA models were considered in the study. In each model, we first generated 
Y with 7T+ = tt_ = 0.5. We fixed ^_ = and fi + = E/3 Ba y cs . 

Model 1: n = 150, p = 400. E has AR(0.5) structure. 



/3 Bayes = 0.556(3, 1.5, 0,0, 2,0 



Model 2: n = 200, p = 400. E has AR(0.5) structure. 



p-5j T - 



/3 Ba y es = 0.582(3, 2.5, -2.8,0 



p-3j T - 



Model 3: n = 400, p = 800. E has CS(0.5) structure. 

^Bayes = 9.395(3, 1.7, _2.2, -2.1, 2.55, P _ 



\T 



Model 4: n = 300, p = 800. E is block diagonal with 5 blocks of dimension 160 x 160. 
Each block has CS(0.6) structure. 

^Bayes = .916(1.2, -1.4, 1.15, -1.64, 1.5, -1, 2, P _ 7 ) T . 

We transform V to X by X = g(V) and the final data to be used are (X, Y). In each type of 
model, we consider two sets of g. We call the resulting models series a and b. In series a, X = V 
so that the SeLDA model becomes the LDA model. In series b, we considered some commonly 
used transformations such that that some features become heavily skewed, some heavy-tailed and 
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Table 1: Choices of gj in Models lb-4b. 



9j(v) 


Models lb,2b 


Model 3b 


Model 4b 




3 




3 




3 




V 3 


1,101,.. 


.,150 


1,201,. 


. . , 300 


3,201,. 


.,300 


exp {y ) 


2,151,.. 


.,200 


2,301,. 


. . , 400 


4,301,. 


.,400 


arctan(v) 


3,201,.. 


.,300 


3,401,. 


. . , 500 


5,401,. 


.,500 


V 3 


4,... 


,50 


4,6,.. 


.,100 


1,8,.. 


,100 


$(v) 


51,... 


,100 


5,101,. 


. . , 200 


2,101,. 


.,200 


(v + 1) 3 


301,... 


,350 


501,.. 


.,600 


6,501,. 


.,600 


arctan(2f ) 


351,... 


,400 


601,.. 


.,800 


7,601,. 


.,800 



some bounded. The choices of g are listed in Table 1 . In the simulation study we also considered 
the oracle sparse discriminant classifiers including oracle DSDA and oracle Witten. The idea is 
to apply the true transformation to variables and then fit a sparse LDA classifier using DSDA or 
Witten and Tibshirani's method. 

The simulation results for Models la^-a and Models lb-4b are reported in Table 2 and Table 
3, respectively. Note that in Table 2 DSDA and Witten are the oracle DSDA and the oracle Witten. 
We can draw the following conclusions from Tables 2 and 3. 

• Models la-4a are actually LDA models. HD-SeSDA performs very similarly to DSDA. 
Although HD-SeSDA has slightly higher error rates, this is expected because HD-SeSDA 
does not use the parametric assumption. On the other hand, in Models lb-4b, HD-SeSDA 
performs much better than DSDA. These results jointly show that HD-SeSDA is a much 
more robust sparse discriminant analysis algorithm than those based on the LDA model. 

• In both tables, HD-SeSDA is very close to the oracle DSDA, which empirically shows the 
high quality of the proposed transformation estimator in Section 3.2. In all eight cases, HD- 
SeSDA is a good approximation to the Bayes rule, which is consistent with the theoretical 
results. 

• Se- Witten is a different HD-SeSDA classifier in which Witten and Tibshirani's method is 
used to fit the SeLDA model after estimating the transformation functions. Se-Witten per- 
forms very well in Models la,2a,lb,2b but it performs very poorly in Models 3a,4a,3b,4b. 
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The same is true for the oracle Witten method. By comparing HD-SeSDA and Se-Witten, we 
see that DSDA works better than Witten and Tibshirani's method. In addition to the theory 
in Section 4, the simulation also supports the use of DSDA in fitting the high-dimensional 
sparse semiparametric LDA model. 

5.2 Malaria data 



We further demonstrate HD-SeSDA by using the malaria data (Ockenhouse et al. 2006). This 
dataset is available at 

\protect \vrule width Opt \protect \href {http : //www . ncbi . nlm. nih . gov/sites/GDSb: 

Out of 71 samples in the dataset, 49 have been infected with malaria, while 22 are healthy people. 
The predictors are the expression levels of 22283 genes. The 7 1 samples were split with 2: 1 ratio to 
form training and testing sets. We report the median of 100 replicates in Table 4. Besides DSDA, 
the l\ logistic regression (Friedm an et al.|2008 ) was also considered because it is an obvious choice 



for sparse high-dimensional classification. From Table 4, it can be seen that HD-SeSDA is slightly 
more accurate than DSDA and the l\ logistic regression. In addition, HD-SeSDA selects 6 genes, 
while the other two methods select about 22 genes. 

To gain more insight, we compared the selected genes by HD-SeSDA and those by DSDA or l\ 
logistic regression. In those 100 tries the 2059th gene is most frequently selected by HD-SeSDA, 
but seldom by DSDA or l-y logistic regression. This gene is encoded by IRF1, as it is the first 



identified interferon regulatory transcription factor (http://en.wikipedia.org/wiki/IRFl ). Discover- 
ing the role of IRF1 was a major finding in Ockenhouse et al. ( 2006| ). Previous studies show that 



IRF1 influences the immune response. Therefore, healthy and sick people may have different 
expression levels on this gene. It is very interesting that we can use a pure statistical method like 
HD-SeSDA to select IRF1. We plot in Figure 1 the within-group density functions of gene IRF1 
(the 2059th gene). It can be seen that the raw expression levels of IRF1 are skewed, making 
linear rules unreliable on this gene. After applying the estimated transformation, the distributions 
of both groups become close to normal, with similar variances. The transformation separates the 
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Table 2: Simulation results for Models la-4a. The reported numbers are medians based on 2000 
replications. Their standard errors obtained by bootstrap are in parentheses. TRUE selection and 
FALSE selection denote the numbers of selected important variables and unimportant variables, 
respectively. 





Bayes 


Oracle 
DSDA 


HD-SeSDA 


DSDA 


Oracle 
Witten 


Se-Witten 


Witten 


Model 1 (a) 
















Error(%) 


10 


10.71 


11.5 


10.71 


11.39 


11.56 


11.39 






(0.02) 


(0.03) 


(0.02) 


(0.02) 


(0.01) 


(0.02) 


TRUE selection 


3 


3 


3 


3 


3 


3 


3 






(0) 


(0) 


(0) 


(0) 


(0) 


(0) 


FALSE selection 





1 


2 


1 


26 


26 


26 






(0.14) 


(0.38) 


(0.14) 


(0.42) 


(0.09) 


(0.42) 


Model 2 (a) 
















Error(%) 


10 


11.09 


11.66 


11.09 


13.36 


13.46 


13.36 






(0.02) 


(0.03) 


(0.02) 


(0.03) 


(0.04) 


(0.03) 


TRUE selection 


3 


3 


3 


3 


3 


3 


3 






(0) 


(0) 


(0) 


(0) 


(0) 


(0) 


FALSE selection 





5 


6 


5 


24 


24 


24 






(0.37) 


(0.51) 


(0.37) 


(0) 


(0) 


(0) 


Model 3 (a) 
















Error(%) 


20 


21.93 


22.13 


21.93 


33.69 


34.18 


33.69 






(0.03) 


(0.03) 


(0.03) 


(0.01) 


(0) 


(0.01) 


TRUE selection 


5 


5 


5 


5 


3 


5 


3 






(0) 


(0) 


(0) 


(0) 


(0) 


(0) 


FALSE selection 





14 


13 


14 


419.5 


795 


419.5 






(0.59) 


(0.57) 


(0.59) 


(10.19) 


(0) 


(10.19) 


Model 4 (a) 
















Error(%) 


10 


12.50 


13.20 


12.50 


23.90 


26.14 


23.90 






(0.02) 


(0.05) 


(0.02) 


(0.01) 


(0.01) 


(0.01) 


TRUE selection 


7 


7 


7 


7 


4 


5 


4 






(0) 


(0) 


(0) 


(0) 


(0.02) 


(0) 


FALSE selection 





18 


17 


18 


35 


153 


35 






(0.70) 


(0.54) 


(0.70) 


(4.43) 


(0) 


(4.43) 
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Table 3: Simulation results for Models la-4a. The reported numbers are medians based on 2000 
replications. Their standard errors obtained by bootstrap are in parentheses. TRUE selection and 
FALSE selection denote the numbers of selected important variables and unimportant variables, 
respectively. 





Bayes 


Oracle 
DSDA 


HD-SeSDA 


DSDA 


Oracle 
Witten 


Se- Witten 


Witten 


Model 1 (b) 
















Error(%) 


10 


10.71 


11.42 


18.24 


11.39 


11.56 


16.19 






(0.02) 


(0.04) 


(0.10) 


(0.01) 


(0.02) 


(0.05) 


TRUE selection 


3 


3 


3 


3 


3 


3 


3 






(0) 


(0) 


(0) 


(0) 


(0) 


(0) 


FALSE selection 





1 


2 


2 


26 


26 


25 






(0.14) 


(0.42) 


(0) 


(0.42) 


(0.09) 


(0.50) 


Model 2 (b) 
















Error(%) 


10 


11.09 


11.66 


19.47 


13.36 


13.46 


20.16 






(0.02) 


(0.03) 


(0.09) 


(0.03) 


(0.03) 


(0.04) 


TRUE selection 


3 


3 


3 


3 


3 


3 


2 






(0) 


(0) 


(0) 


(0) 


(0) 


(0) 


FALSE selection 





5 


6 


2 


24 


24 


20 






(0.37) 


(0.51) 


(0) 


(0) 


(0.32) 


(0-17) 


Model 3 (b) 
















Error(%) 


20 


21.93 


22.13 


26.76 


33.69 


34.18 


34.25 






(0.03) 


(0.03) 


(0.03) 


(0.01) 


(0) 


(0) 


TRUE selection 


5 


5 


5 


5 


3 


5 


5 






(0) 


(0) 


(0) 


(0) 


(0) 


(0) 


FALSE selection 





14 


13 


15 


419.5 


795 


795 






(0.59) 


(0.57) 


(0.67) 


(10.19) 


(0) 


(0) 


Model 4 (b) 
















Error(%) 


10 


12.50 


13.4 


19.88 


23.90 


26.14 


26.83 






(0.02) 


(0.03) 


(0.04) 


(0.01) 


(0.01) 


(0.01) 


TRUE selection 


7 


7 


7 


6 


4 


5 


6 






(0) 


(0) 


(0) 


(0) 


(0.02) 


(0.23) 


FALSE selection 





18 


17 


25 


35 


153 


153 






(0.70) 


(0.54) 


(0.83) 


(4.43) 


(0) 


(0.09) 
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Table 4: Comparison of HD-SeSDA, DSDA and t\ logistic regression on the malaria dataset. The 
reported numbers are medians of 100 replicates, with standard errors obtained by bootstrap in 
parentheses. 





HD-SeSDA DSDA Logistic 


Testing Error 
Fitted Model Size 


1/23(0%) 2/23(2.06%) 2/23(0.17%) 
6(0.86) 22.5(1.76) 23(0.83) 



two groups farther apart from each other, which helps explain the more accurate classification by 
HD-SeSDA. 

6 Discussion 



It has been a hot subject of research in recent years to develop sparse discriminant analysis for high- 
dimensional classification and feature selection, rejuvenating the traditional discriminant analysis. 
However, sparse discriminant algorithms based on the LDA model can be very ineffective for non- 
normal data, as shown in the simulation study. To overcome the normality limitation, we consider 
the semiparametric discriminant analysis model and propose the HD-SeSDA, a high-dimensional 
semiparametric sparse discriminant classifier. We have justified HD-SeSDA both theoretically 
and empirically. For high-dimensional classification and feature selection, HD-SeSDA is more 
appropriate than the existing sparse discriminant analysis proposals in the literature. 

Theorem 1 is at the core of our theoretical analysis of HD-SeSDA. As commented in remark 



1, one could also build semiparametric version of LPD by Cai & Liu (2011) or ROAD by Fan 
et al. ( 2012[ ) after using Theorem 1 to transform the data. The resulting new semiparametric sparse 
discriminant analysis techniques should have nice theoretical properties as well, just like HD- 



SeSDA. With Theorem 1 in hand, the rest analysis is pretty much in line with those in Cai & Liu 



( |2011 ) and Fan et al. (2012). Due to space consideration and for presentation purpose, we do not 
analyze these two methods in detail in the present paper. 
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Density for Raw Data 



Density for Transformed Data 




-10 12 

Gene expression level for IRF1 




-4 -2 

Gene expression level for IRF1 



Figure 1: Density functions of gene IRF1 (the 2059th gene) in the malaria data. The plot on the 
left displays the density function of the normalized raw data, while the one on the right is of the 
transformed data. 

Appendix: proofs 

The following properties of the normal distribution are repeatedly used in our proof. 
Proposition 1. Let <p{t) and $(£) be the pdfand CDF ofN(0, 1). 



1. Fort > 1, 



2. Fort > 0.99. 



«*) < i _ m < *w 



2t 



t ' 



$-!(£)< i/2lo. 



i-r 



Proposition 1 is an elementary and classic result in probability and hence its proof is omitted 
for the sake of space. 



To prove Theorem 111 we first study the accuracy of hj. The behavior of hj = $ o F 3 
drastically varies on the real line. Define 



A n = [-y/iitogn, y/n/i log 



??. 



(21) 
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e 2 v _ n 1 ^ 1 



327r 2 7 1 logn ' 167T7ilogn' 



where < 71 < 1 is a fixed number and n is the sample size. The following lemma shows that 
hj(x) is an accurate estimator of hj(x) for hj(x) G A n . 

Lemma 1. For sufficiently large n and < 71 < 1, we have 

Pr( sup \hj(x) — hj(x)\ > e) < 2exp(— n 1-71 — 2 — ) + 2exp 

hj(x)&A n 

Proof of Lemma^ By mean value theorem, 

hj(x) - hj{x) = ($- 1 ) / (0(^(x) - Fjix)), 

for some £ G [min(Fj (x) , Fj (x) ) , max(Fj (x) , i^- (x) )] . 

First, we bound |($ _1 )'(C)I- This is achieved by bounding Fj(x) and Fj{x). By definition, for 
any hj(x) G A n , 

71 71 

n 2 / / n 2 

< $(-VTi lo g«) < F j( x ) < $(V7ilog^) < 1 



2-^2^7! logn ' 2^2717! logn 

On the other hand, for x such that /ij (x) G A n 



71 71 

„ / n 2 ~ 77, 2 

Pr( l /n i < Fj(x) < 1 



v 4A/27T7ilogn 4a/27T7i log 

?2 2 

> Pr( sup Fj-(a -J>(g < ) 

A 3 (x)ei„ 4V27T7ilogn 



n 



> l-2exp( 



n l-7i 



167T7i logn 

where the last inequality follows from Dvoretzky-Kiefer-Wolfowitz (DKW) inequality. 
Consequently, with a probability no less than 1 — 2 exp( 



n l-7l 



167T7i log n 



_71 _7l 

n 2 77, 2 

<£ < 1 - 



4-y/27T7i logn ~ ' 4A/27T7! log n 

and, combining this fact with Proposition [Tj we have 

1 . $ _1 (£) 2 



(fl) 2 , 

< v2tt exp (log (4n^\/27r7i logn)) 

71 / 

= 87m 2 a/7i logn = M n . 
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Then 



Pr( sup \hj(x) — hj(x)\ > e) 

hj(x)£A n 

< Pr(M n sup \Fj(x) - Fj(x)\ > e) + 2exp( 



n l-7l 



\(z)eA n 167T7ilogn 

For the first term on the right hand side, 

Pr(M n sup \Fj(x) -Fj(x)\ > e) 

hj(x)eA„ 

< Pr(M n sup \F j (x)-F j (x)\>^) + Pi(M n sup \Fj(x) - Fj(x)\ > ^). 

Because sup^.^g^ |F}(a;) — i^-(a;)| < 5„ = — , 5 n M n — >■ and so the first term is for suffi- 
ciently large n. Apply the DKW inequality to the second term and the conclusion follows. □ 

The above lemma guarantees that hj(Xj) is very close to hj(Xj) on A n . Now we consider 
observations in A c n . Because such observations are relatively few, Their influence is limited in 
estimating \x y j and £■,•&. Partition A c n to three regions: 

B n = [-72 log n, - V7i log n) U ( a/7i log n, 72 log n] ; 
C n = [-n 73 , -72 log n) U (72 log n, n 73 ] ; 
-D„ = (—00, — n 73 ) U (n 73 , 00). 



' n 



Define #-B n = #{i : hj(Xj) e _B n } and #C n , #-D„ analogously. We have the following lemma. 

7i 
Lemma 2. For sufficiently large n and positive constants a 1: a 2 such that ol\ > 1 — — , we /rave 

sup |/ij(a;) — /ij(x)| < 2\/logn + 72 logn; (22) 

hj(x)eB n 



sup |/ij(x) — /ij(x)| < 2 \J log n + n 73 ; (23) 

hj(x)ec„ 

Pr(#5 n >n ai ) < e xp(-^--); (24) 

Pr(#C n >n a2 ) < exp(-^--); (25) 

2ri 1-73 n 273 

Pr(#L> n > 1) < __exp(-— -). (26) 

V 27T ^ 
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Proof of Lemma^ Equations (22)-(23) are direct consequences of the definitions of h and B n , 
C n . Indeed, because F < 1 — S n , by Proposition [l] for ie5„U C n 



\hj{x)\ < $- x (l - S n ) < J21og j- = 2v/loI^. 
Combining this bound with the definitions of B n , C n , we have the desired conclusions. 



For (24), note that, for sufficiently large n, 



Pr(^(X,-) G B n ) < 2Pr(h j (X j ) > Vlilogn) < 



\/2n 2 _ 1L 

< n 2 . 



a/7T7i log n 



Therefore, by Hoeffding's inequality 



Pr(#£ n >n Ql ) 

n 

< Pr(^[(/(^(Xj) G B n )) - Pr(/i i (X;) G B n )] > n ai - n 1 ' 



21 

2 



8=1 



< exp( 



< exp( 



n 



1 — n 2 



n 



2ai-l 



for sufficiently large n. 
For p5] >, note that 



Pr(/i i (X;) G C n ) < 



72 log n 



So ( 25 ) can be proven similarly. 



For (26), 



Pr(#D n > 1) < 2nPr(/i i (XJ) > n 73 ) < 



271 1 " 73 , n 273 , 
ex P( ?H- 



D 



ZVoo/o/T/zeoremlT] We first prove ( 14 ) 



1 A 



Pr(|A,-^|>e) < Pr(-^|^(X;)-^(X;)|>-) + Pr(| 

i=i 



1 A 



hj(X})-^\>-) 
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By the Chernoff bound, L 2 < 2exp(—Cne 2 



- 11 y> 
L\ < Pr( sup \hj(x) — hj(x)\ > -) + Pr( sup \hj(x) — hj(x)\ > -) 

hj(x)aA n ' 8 n hj(x)eB n ' 8 

+ Pr( sup \hj(x) — hj(x)\ > -) + Pr( sup \hj(x) — hj(x)\ > -). 

n %(x)ec„ " 8 n hjix^Dn 8 

By Lemma[2j it can be checked that, under Condition (CI), if #-£>„ < n ai and #D n = then 



Pr( 



Pr 



n h :j (x)eB n 

■#D n 



sup \h j (x)-h j (x)\> -)=0, 



n hj(x)eD, 



sup \hj(x) — hj(x)\ > -) = 0, 



for sufficiently large n. If 73 + a 2 < 1, similarly we have 

Pr( sup \hj(x) — hj(x)\ > -) = 0. 

n h D {x)ac n ' 8 



It follows that, if Q!i < 1 and 73 + a 2 < 1, then we have 



2n 1 "^ 



n 



273 



Li < 4exp(-Cn 1 - 71 — ) + exp(-Cn 2 ° 1 - 1 ) + exp(-Cn 2 ° 2 - 1 ) + expi 

7i V27T 2 



3 10 

Take 7i = #,ai = l — -,a 2 = T — t:,73 = t — 77 and the conclusion follows. 



Now we prove ( 15 ). By the proof in Liu et al. (2009), it suffices to bound 

1 n 
Pr{\-Y l h j {X*){h k {X i k )-h k {X i k ))\>e). 

8=1 

We can decompose the summation into four terms. 

1 n 

-J2hAx;)(h k (xi)-h k (xi)) 



j=i 



1 

—1 

n 



E + E 

^■(Xj)6D n Or h k (Xl)eD n hj(XJ)<jtD n ,h k (Xi)£C n 

h j {X])&A n YJB n ,h k {Xl)eB n hj{X*)eA n ,h k {Xl)&A n 
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Write #D nj = #{* : h 3 (Xj) E D n }. Then 

Pr(N>e) < Pr(# J D nj > 1) + Pr(#D nfc > 1) 
4n 1_73 . n 2ri . 

Note that, for a pair of a 2 , 73, such that a 2 + 273 — 1 < 0, we have n a2+2l3 ~ l — > 0. Therefore, for 
sufficiently large n, 



Pr(|S 2 |> e ) < Pr(- J2 MXi) - h k (Xl)\ > 



€ 






< Pr(#C n > n a2 ) + Pr(n Q2 - 1 (2 v/logn + n 73 ) > — ) 



n 2a 2 -l 



< exp( 4— ) + 0, 

Similarly, for < a% < 1, 



Pr(|S 3 |>e) < Pr(#fi n >ri Ql ) + Pr(n ai - 1 (7 2 logn)(2 v ^oi^ + 7 2 logn) >e) 

n 2ax-l 

< exp( ^— ) + 0, 

where < cui < 1. Finally, 

Pr(|5 4 |>e) < Pr( sup \h k {X{) - h k {X{)\ > * ) 

h k (Xi)eA n V7ilogn 



< 4exp(-C 



n x -^e 2 N 
7f log n 



Pick 7! = p, 73 = - — p, a 2 — - — -, Q!i = 1 — - and the conclusion follows. □ 

O Zi _ 



.Proo/"o/'Cor0//ary[7] Note that n.+ is a summation of n i.i.d random variables with distribution 
Bernoulli(l, 71+) . Therefore, by Cherno 
1 — 2exp(— en). Hence, by Theorem[lj 



Bernoulli (l,7r + ). Therefore, by Chernoff bound, there exists c > such that Pr(n + > — n) > 

Zi 



Similarly, 



Pr(|/i +j - a +j \ > -) < Ci(— 2 — ) + 2exp(-cn) 



Pr(|/}-i -//_,-! > |) < Ci*(^p) + 2exp(-cn). 



Hence, we have ( fl"6"] ). Equation ( fTTj ) can be proven similarly. D 
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Proof of Theorem^and Theorem^. By Mai et al. (2012), the consistency is implied by accurate 
estimators of fi y , Cij. Therefore, Theorem^ can be proven by following the proof in their paper 
and applying Corollary [T] □ 

Theorem [3] is direct consequence of Theorem [2] Hence, the proof is omitted here for the sake 
of space. 

Lemma 3. For any e < min{eo, , A} and large enough n such that e ^$> sn" 1 ' 4 ", we have 



Pr(||&4 - foil! > e) < 2s 2 C 2 (^) + 2<i(^ 

2. If we further assume that < c < 7r + , 7r_ < C, then 

Pr(|A,-A)| >Ce)<2exp(-Cn) + CsCi 
,\(l-K + 2e<f>) 



s(<j>A 1 + A, 



(27) 



(28) 



4(1 + «) sA 2 s 



Proo/ We first prove ( [27] ). Similar to the proof of Conclusion 3, Theorem 1 in |Mai et ah] ( |20 1 2[ ) , 
we have 



1 ,\ 



Pa\\i < ; t(„ + 011 (A+a - P-a) - {^+a - A*-a)||i + 0V A l) 

1 — r/10 2 



(29) 



where 771 = \\C A a - CaaIU- Under the events 771 < e and \\(jl + A - fa- a) - {n+A - H-a)\\i < e 



we have \\/3a — /?a||i < e- Hence, ( [27] ) follows. 
For ( [28] ), assume that ^o = 0. Then we have 



I A) - A> 



72+ 7T 2 (AlA + AlA) T /3A 1 

= log ^g— r 

7l_ 7Tl 2 

< I log 7T + — log 7T_ I + I log 7Tl — log 7Ti | 

+ K/iiA + ^2aY0a -Pa)\ + ^+a + H-a) t 0a - Pa) 



Under the events !#,- — 7r,-| < mini-, — }, \\uja — MmIIi < -rr- and IIA4 — /5a 111 < -1— , we have 

2 c " 0Ai A 2 



|/?o-/3o|<Ce. 



D 
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Proof of Theorem^ Note that 

R n < 1 - Pr(F = sign(h(X) T (3 + A,), sign(/i(X) T /3 + A,) = sign(h(X) T (3 + A)) 
< R + Pr(sign(/i(X) T /3 + A>) ^ sign(/i(X) T /3 + A,)) 



Therefore, 



Rn-R < Pr(sign(/ i (X) T /3 + /3 )^sign(/ i (X) T /3 + /3o)) (30) 

< Pr(|/i(X) T /3 + /3 |<e) (31) 

+ Pr(|(A(X) T /3 + A,) - {HXyp + /9b)| > |) 



Pr(|/ i (X) T /3 + /3 |<e)<^ (32) 

V Z7T 



Now 



For the second term, assume that (5 A c = 0, |/3 — A)| < Ce> 11/3.4 — A4II1 < —7^= and 

V log n 
e 



sup teAn |/ij(t) — hj(t)\ < C—— for all j, where A n is defined as in pTj ). Then 



|(/i(X A ) T /3 A + A,) - (/i(X A ) T /3 A + ft)! (33) 

< 0o - Ail + HfcpCOIUlA* - Ailli + ||MX A ) - ^(X^IUII^Hx (34) 

< \p -P \+2 v ^g~^\\p A -(3 A \\ 1 + <f ) A 1 \\h(X A )-h(X A )\U (35) 

which is smaller than e as long as hj(Xf) G A„ for all j. Therefore, take 71 = 1/2 in A n , we have 

e r y ST7 -1 / 4 

Pr(|(/i(X) T /3 + A) - (HXyp + A>)| > -) < Pr(U jeA ^-(X J ) G A n ) < -==, (36) 

2 A/log rz 

which will be smaller than e for sufficiently large n. 

Therefore, by Lemma[T} ( |T7] ), ( [28] ), we have the desired conclusion. □ 
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